Walmart Product Price Prediction¶

Table of contents ¶

  1. Objectives
  2. Short Data Description
  3. Metadata
  4. Import Libraries
  5. Import Dataset
  6. Data Cleaning
  7. Exploratory Data Analysis
  8. Feature Engineering
  9. Modeling

Group G

  • Dev Makwana (885064)
  • Joel Crasto (883863)
  • Krina Patel (886861)
  • Mahaveersinh Chauhan (884854)
  • Trushna Patel (886910)

Objectives ¶

  • The fundamental goal of this project is to develop an accurate and effective regression model for price prediction of Walmart groceries. Moreover, insigths are revealed with Exploratory Data Analysis.

  • What is the average price of products in Walmart Grocery?

  • What is the average product size (in terms of weight) in Walmart Grocery?

  • What are the most popular brands in Walmart Grocery?

  • What are the most popular products in Walmart Grocery?

  • Predict the price of the products?

  • Check department-wise sales at stores and for different stores of Walmart?

Short Data Description ¶

  • For this project we have used the “Walmart Product” dataset which is openly available on the Kaggle. In this dataset it has multiple columns such as product size, product category, shipping location and many more.

Metadata ¶

Field Name Description Example
SHIPPING_LOCATION The location where the product is shipped from. 79936
DEPARTMENT The department in which the product is categorized. Deli
CATEGORY The category in which the product is categorized. Hummus, Dips, & Salsa
SUBCATEGORY The subcategory in which the product is categorized. White Wine
BREADCRUMBS The breadcrumbs for the product. Deli/Hummus, Dips, & Salsa
SKU The SKU for the product. 110895339
PRODUCT_URL The URL for the product. https://www.walmart.com/ip/Marketside-Roasted-Red-Pepper-Hummus-10-Oz/110895339?fulfillmentIntent=Pi...
PRODUCT_NAME The name of the product. Marketside Roasted Red Pepper Hummus, 10 Oz
BRAND The brand of the product. Marketside
PRICE_RETAIL The retail price of the product. 2.67
PRICE_CURRENT The current price of the product. 2.67
PRODUCT_SIZE The size of the product. 10
PROMOTION The promotion for the product. NULL
RunDate The date on which the data was collected. 2022-09-11 21:20:04
Tid Transaction ID 16163804

Importing Libraries ¶

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly
import plotly.io as pio

from sklearn.decomposition import IncrementalPCA
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV
from sklearn.model_selection import GridSearchCV, KFold

import math

Import Dataset ¶

In [2]:
dataset = pd.read_csv("WMT_Grocery_202209.csv", index_col=0)
c:\users\chauh\appdata\local\programs\python\python37\lib\site-packages\IPython\core\interactiveshell.py:3553: DtypeWarning: Columns (4) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
In [3]:
dataset.head()
Out[3]:
SHIPPING_LOCATION DEPARTMENT CATEGORY SUBCATEGORY BREADCRUMBS SKU PRODUCT_URL PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE PROMOTION RunDate tid
index
0 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 110895339 https://www.walmart.com/ip/Marketside-Roasted-... Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163804
1 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 105455228 https://www.walmart.com/ip/Marketside-Roasted-... Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163805
2 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 128642379 https://www.walmart.com/ip/Marketside-Classic-... Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163806
3 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 366126367 https://www.walmart.com/ip/Marketside-Everythi... Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163807
4 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 160090316 https://www.walmart.com/ip/Price-s-Jalapeno-Di... Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12 NaN 2022-09-11 21:20:04 16163808
  • Below is the statistical description of all the numeric features of the dataset.
In [4]:
dataset.describe()
Out[4]:
SHIPPING_LOCATION SKU PRICE_RETAIL PRICE_CURRENT PROMOTION tid
count 568534.000000 5.685340e+05 568534.000000 568534.000000 0.0 5.685340e+05
mean 57713.149935 2.453284e+08 5.105377 5.079274 NaN 1.644807e+07
std 24113.969156 3.049375e+08 4.824906 4.727971 NaN 1.641218e+05
min 6010.000000 8.795550e+05 0.160000 0.160000 NaN 1.616380e+07
25% 33647.000000 1.555602e+07 2.480000 2.480000 NaN 1.630594e+07
50% 63376.000000 5.003978e+07 3.880000 3.880000 NaN 1.644807e+07
75% 77449.000000 4.448820e+08 5.980000 5.980000 NaN 1.659020e+07
max 96797.000000 1.996767e+09 246.870000 246.870000 NaN 1.673234e+07
  • Following is the basic information of the dataset e.g., Non-null counts and Data types of each column.
In [5]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 568534 entries, 0 to 568533
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SHIPPING_LOCATION  568534 non-null  int64  
 1   DEPARTMENT         568534 non-null  object 
 2   CATEGORY           568534 non-null  object 
 3   SUBCATEGORY        361324 non-null  object 
 4   BREADCRUMBS        568534 non-null  object 
 5   SKU                568534 non-null  int64  
 6   PRODUCT_URL        568534 non-null  object 
 7   PRODUCT_NAME       568534 non-null  object 
 8   BRAND              568507 non-null  object 
 9   PRICE_RETAIL       568534 non-null  float64
 10  PRICE_CURRENT      568534 non-null  float64
 11  PRODUCT_SIZE       505709 non-null  object 
 12  PROMOTION          0 non-null       float64
 13  RunDate            568534 non-null  object 
 14  tid                568534 non-null  int64  
dtypes: float64(3), int64(3), object(9)
memory usage: 69.4+ MB
  • Below is the sum of null values with respect to each column.
In [6]:
dataset.isnull().sum()
Out[6]:
SHIPPING_LOCATION         0
DEPARTMENT                0
CATEGORY                  0
SUBCATEGORY          207210
BREADCRUMBS               0
SKU                       0
PRODUCT_URL               0
PRODUCT_NAME              0
BRAND                    27
PRICE_RETAIL              0
PRICE_CURRENT             0
PRODUCT_SIZE          62825
PROMOTION            568534
RunDate                   0
tid                       0
dtype: int64
In [7]:
dataset
Out[7]:
SHIPPING_LOCATION DEPARTMENT CATEGORY SUBCATEGORY BREADCRUMBS SKU PRODUCT_URL PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE PROMOTION RunDate tid
index
0 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 110895339 https://www.walmart.com/ip/Marketside-Roasted-... Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163804
1 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 105455228 https://www.walmart.com/ip/Marketside-Roasted-... Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163805
2 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 128642379 https://www.walmart.com/ip/Marketside-Classic-... Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163806
3 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 366126367 https://www.walmart.com/ip/Marketside-Everythi... Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10 NaN 2022-09-11 21:20:04 16163807
4 79936 Deli Hummus, Dips, & Salsa NaN Deli/Hummus, Dips, & Salsa 160090316 https://www.walmart.com/ip/Price-s-Jalapeno-Di... Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12 NaN 2022-09-11 21:20:04 16163808
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
568529 70072 Alcohol Wine White Wine Alcohol/Wine 593600139 https://www.walmart.com/ip/Farm-Fresh-Blueberr... Farm Fresh Blueberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750 NaN 2022-09-11 21:20:04 16732333
568530 70072 Alcohol Wine White Wine Alcohol/Wine 333403243 https://www.walmart.com/ip/Farm-Fresh-Peach-Mo... Farm Fresh Peach Moscato 750 Ml Farm Fresh Wine Company 9.98 9.98 750 NaN 2022-09-11 21:20:04 16732334
568531 70072 Alcohol Wine White Wine Alcohol/Wine 526588325 https://www.walmart.com/ip/Farm-Fresh-Raspberr... Farm Fresh Raspberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750 NaN 2022-09-11 21:20:04 16732335
568532 70072 Alcohol Wine White Wine Alcohol/Wine 286992782 https://www.walmart.com/ip/Farm-Fresh-Mango-Mo... Farm Fresh Mango Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750 NaN 2022-09-11 21:20:04 16732336
568533 70072 Alcohol Wine White Wine Alcohol/Wine 160015930 https://www.walmart.com/ip/Ole-Orleans-Heritag... Ole Orleans Heritage Riesling 750ml Ole Orleans 18.98 18.98 750 NaN 2022-09-11 21:20:04 16732337

568534 rows × 15 columns

Data Cleaning ¶

In [8]:
cleaned_dataset = dataset.copy()
In [9]:
dropping_columns = ["RunDate", "SKU", "PROMOTION", "tid", "PRODUCT_URL", "SUBCATEGORY"]
  • Some of the columns such as "RunDate", "SKU", "PROMOTION", "tid", "PRODUCT_URL", "SUBCATEGORY", which are not important in further data analysis, are dropped here.
In [10]:
cleaned_dataset.drop(dropping_columns, axis=1, inplace=True)
In [11]:
cleaned_dataset
Out[11]:
SHIPPING_LOCATION DEPARTMENT CATEGORY BREADCRUMBS PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE
index
0 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10
1 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10
2 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10
3 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10
4 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12
... ... ... ... ... ... ... ... ... ...
568529 70072 Alcohol Wine Alcohol/Wine Farm Fresh Blueberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750
568530 70072 Alcohol Wine Alcohol/Wine Farm Fresh Peach Moscato 750 Ml Farm Fresh Wine Company 9.98 9.98 750
568531 70072 Alcohol Wine Alcohol/Wine Farm Fresh Raspberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750
568532 70072 Alcohol Wine Alcohol/Wine Farm Fresh Mango Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750
568533 70072 Alcohol Wine Alcohol/Wine Ole Orleans Heritage Riesling 750ml Ole Orleans 18.98 18.98 750

568534 rows × 9 columns

  • Product Size feature is converted to numeric from string Data-type.
In [12]:
cleaned_dataset["PRODUCT_SIZE"] = pd.to_numeric(cleaned_dataset["PRODUCT_SIZE"],errors='coerce')
In [13]:
cleaned_dataset
Out[13]:
SHIPPING_LOCATION DEPARTMENT CATEGORY BREADCRUMBS PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE
index
0 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10.0
1 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10.0
2 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10.0
3 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10.0
4 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12.0
... ... ... ... ... ... ... ... ... ...
568529 70072 Alcohol Wine Alcohol/Wine Farm Fresh Blueberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
568530 70072 Alcohol Wine Alcohol/Wine Farm Fresh Peach Moscato 750 Ml Farm Fresh Wine Company 9.98 9.98 750.0
568531 70072 Alcohol Wine Alcohol/Wine Farm Fresh Raspberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
568532 70072 Alcohol Wine Alcohol/Wine Farm Fresh Mango Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
568533 70072 Alcohol Wine Alcohol/Wine Ole Orleans Heritage Riesling 750ml Ole Orleans 18.98 18.98 750.0

568534 rows × 9 columns

In [14]:
cleaned_dataset.isnull().sum()
Out[14]:
SHIPPING_LOCATION        0
DEPARTMENT               0
CATEGORY                 0
BREADCRUMBS              0
PRODUCT_NAME             0
BRAND                   27
PRICE_RETAIL             0
PRICE_CURRENT            0
PRODUCT_SIZE         63560
dtype: int64
Handling Null values¶
  • In the following sell, we are dropping all the null values.
  • As there are a lot of null values, we couldn't find any effective method to impute null value without imposing bias and noice to the data.
  • Moreover, we have enough data so dropping the data containing null values will hardlt affect the model and futher data analysis.
In [15]:
cleaned_dataset.dropna(inplace=True)
In [16]:
cleaned_dataset.isnull().sum()
Out[16]:
SHIPPING_LOCATION    0
DEPARTMENT           0
CATEGORY             0
BREADCRUMBS          0
PRODUCT_NAME         0
BRAND                0
PRICE_RETAIL         0
PRICE_CURRENT        0
PRODUCT_SIZE         0
dtype: int64
In [17]:
cleaned_dataset.reset_index(drop=True, inplace=True)
In [18]:
cleaned_dataset
Out[18]:
SHIPPING_LOCATION DEPARTMENT CATEGORY BREADCRUMBS PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE
0 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10.0
1 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10.0
2 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10.0
3 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10.0
4 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12.0
... ... ... ... ... ... ... ... ... ...
504969 70072 Alcohol Wine Alcohol/Wine Farm Fresh Blueberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504970 70072 Alcohol Wine Alcohol/Wine Farm Fresh Peach Moscato 750 Ml Farm Fresh Wine Company 9.98 9.98 750.0
504971 70072 Alcohol Wine Alcohol/Wine Farm Fresh Raspberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504972 70072 Alcohol Wine Alcohol/Wine Farm Fresh Mango Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504973 70072 Alcohol Wine Alcohol/Wine Ole Orleans Heritage Riesling 750ml Ole Orleans 18.98 18.98 750.0

504974 rows × 9 columns

Exploratory Data Analysis ¶

In [19]:
eda_df = cleaned_dataset.copy()
In [20]:
eda_df
Out[20]:
SHIPPING_LOCATION DEPARTMENT CATEGORY BREADCRUMBS PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE
0 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10.0
1 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10.0
2 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10.0
3 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10.0
4 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12.0
... ... ... ... ... ... ... ... ... ...
504969 70072 Alcohol Wine Alcohol/Wine Farm Fresh Blueberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504970 70072 Alcohol Wine Alcohol/Wine Farm Fresh Peach Moscato 750 Ml Farm Fresh Wine Company 9.98 9.98 750.0
504971 70072 Alcohol Wine Alcohol/Wine Farm Fresh Raspberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504972 70072 Alcohol Wine Alcohol/Wine Farm Fresh Mango Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504973 70072 Alcohol Wine Alcohol/Wine Ole Orleans Heritage Riesling 750ml Ole Orleans 18.98 18.98 750.0

504974 rows × 9 columns

In [21]:
corr_train = eda_df.corr()
fig = px.imshow(corr_train, text_auto=True, aspect="auto", color_continuous_scale='RdBu_r')
fig.update_layout(title_text='Correlation Matrix', title_x=0.5, width=950, height=800)
fig.show()

What is the average price of products in Walmart Grocery?¶

In [24]:
avg_price = eda_df.groupby(["DEPARTMENT"]).agg({"PRICE_RETAIL" : "mean"}).reset_index()
fig = px.bar(avg_price, x='DEPARTMENT', y='PRICE_RETAIL',text_auto='.2s', labels={"DEPARTMENT" : "Department", "PRICE_RETAIL" : "Price"})
fig.update_layout(title_text='Average Price By Department', title_x=0.5, width=1000, height=650)
fig.show()

What is the average product size in Walmart Grocery?¶

In [25]:
avg_size = eda_df.groupby(["DEPARTMENT"]).agg({"PRODUCT_SIZE" : "mean"}).reset_index()
fig = px.bar(avg_size, x='DEPARTMENT', y='PRODUCT_SIZE',text_auto='.2s', labels={"DEPARTMENT" : "Department", "PRODUCT_SIZE" : "Product Size"})
fig.update_layout(title_text='Average Product Size By Department', title_x=0.5, width=1000, height=650)
fig.show()

What are the most popular brands in Walmart Grocery?¶

In [71]:
popular_brand = pd.DataFrame(eda_df["BRAND"].value_counts()).reset_index().sort_values(by=["BRAND"], ascending=False)
fig = px.pie(popular_brand[:10], values='BRAND', names='index')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title_text='Top 10 Popular Brands', title_x=0.45)
fig.show()

What are the most popular products in Walmart Grocery?¶

In [67]:
popular_products = pd.DataFrame(eda_df["PRODUCT_NAME"].value_counts()).reset_index().sort_values(by=["PRODUCT_NAME"], ascending=False)
fig = px.bar(popular_products[:20], x='index', y='PRODUCT_NAME',text_auto='.2s', labels={"PRODUCT_NAME" : "Product Count", "index" : "Product Name"})
fig.update_layout(title_text='Top 20 Popular Products', title_x=0.5, width=1000, height=650)
fig.update_xaxes(tickangle = 45,automargin = False)
fig.show()

Check department-wise price at stores and for different stores of Walmart?¶

Following stacked bar graph represents the total of all prices at different shipping locations department wise. It is observed that Baking and Breakfast and Cereal Department shows the highest prices of products while Fresh Produce and Meat and Seafood shows the lowest product prices.

In [73]:
price_by_shipping_location = eda_df.groupby(["SHIPPING_LOCATION", "DEPARTMENT"]).agg({"PRICE_RETAIL" : "sum"}).reset_index()
price_by_shipping_location["SHIPPING_LOCATION"] = price_by_shipping_location["SHIPPING_LOCATION"].apply(lambda x : str(x))
fig = px.bar(price_by_shipping_location, x='SHIPPING_LOCATION', y='PRICE_RETAIL', color="DEPARTMENT", labels={"SHIPPING_LOCATION" : "Store Locations", "PRICE_RETAIL" : "Price", "DEPARTMENT" : "Department"})
fig.update_layout(title_text='Total Price by Store Location corresponding to different Departments', title_x=0.5, width=1000, height=650)
fig.show()

The pie chart below shows the percentage count of number of products with respect to various departments. Pantry and Breakfast and Cereals department has the highest products while Fresh Produce and Alcohol had lesser products.

In [45]:
count_by_department = eda_df.groupby("DEPARTMENT").agg({'PRODUCT_NAME':'count'}).reset_index()
fig = px.pie(count_by_department, values='PRODUCT_NAME', names='DEPARTMENT', hole=.5, color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(annotations=[dict(text='Department', x=0.5, y=0.5, font_size=30, showarrow=False)])
fig.update_layout(title_text='Percentage of Product counts by Department', title_x=0.45)
fig.show()

The Meat and Seafood section observed the highest sale of around 200k followed by Breakfast and Cereals department which makes around 140k.

In [46]:
top_20_category = eda_df.groupby(["CATEGORY", "DEPARTMENT"]).agg({"PRICE_RETAIL" : "sum"}).reset_index()
top_20_category = top_20_category.sort_values(by=["PRICE_RETAIL"], ascending=False)
fig = px.bar(top_20_category[:20], x='CATEGORY', y='PRICE_RETAIL', color="DEPARTMENT", text_auto='.2s', labels={"CATEGORY" : "Category", "PRICE_RETAIL" : "Price", "DEPARTMENT" : "Department"})
fig.update_layout(title_text='Total Price of Top 20 Categories', title_x=0.5, width=950, height=650)
fig.show()

Below chart shows the top 20 selling brands of all products.

In [47]:
top_20_category = eda_df.groupby(["BRAND"]).agg({"PRICE_RETAIL" : "mean"}).reset_index()
top_20_category = top_20_category.sort_values(by=["PRICE_RETAIL"], ascending=False)
fig = px.bar(top_20_category[:20], x='BRAND', y='PRICE_RETAIL', text_auto='.2s', labels={"BRAND" : "Brand", "PRICE_RETAIL" : "Price"})
fig.update_layout(title_text='Average Price of Top 20 Brands', title_x=0.5, width=950, height=650)
fig.show()

The Graph says that the price distribution is right skewed. It is not normally distributed so we need to convert it to normal distribution.

In [48]:
fig = px.histogram(eda_df, x="PRICE_RETAIL", color="DEPARTMENT", nbins=50, labels={"DEPARTMENT" : "Department", "PRICE_RETAIL" : "Price", "count" : "Count"})
fig.update_layout(title_text='Price Distribution by Department', title_x=0.5, width=950, height=650)
fig.show()

To convert the price into normal distribution we have applied a log transformation function.

In [49]:
logged_df = pd.DataFrame()
logged_df["DEPARTMENT"] = eda_df[["DEPARTMENT"]]
logged_df["LOG_PRICE"] = eda_df["PRICE_RETAIL"].apply(lambda x : np.log(x) if np.log(x) > 0 else 0)

fig = px.histogram(logged_df, x="LOG_PRICE", color="DEPARTMENT", nbins=50, labels={"count" : "Value Count", "LOG_PRICE" : "Log Price", "DEPARTMENT" : "Department"})
fig.update_layout(title_text='Logarithmic Price Distribution by Department', title_x=0.5, width=950, height=650)
fig.show()

Feature Engineering ¶

In [31]:
featured_dataset = cleaned_dataset.copy()
In [32]:
featured_dataset
Out[32]:
SHIPPING_LOCATION DEPARTMENT CATEGORY BREADCRUMBS PRODUCT_NAME BRAND PRICE_RETAIL PRICE_CURRENT PRODUCT_SIZE
0 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Red Pepper Hummus, 10 Oz Marketside 2.67 2.67 10.0
1 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Roasted Garlic Hummus, 10 Oz Marketside 2.67 2.67 10.0
2 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Classic Hummus, 10 Oz Marketside 2.67 2.67 10.0
3 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Marketside Everything Hummus, 10 oz Marketside 2.67 2.67 10.0
4 79936 Deli Hummus, Dips, & Salsa Deli/Hummus, Dips, & Salsa Price's Jalapeno Dip, 12 Oz. Price's 3.12 3.12 12.0
... ... ... ... ... ... ... ... ... ...
504969 70072 Alcohol Wine Alcohol/Wine Farm Fresh Blueberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504970 70072 Alcohol Wine Alcohol/Wine Farm Fresh Peach Moscato 750 Ml Farm Fresh Wine Company 9.98 9.98 750.0
504971 70072 Alcohol Wine Alcohol/Wine Farm Fresh Raspberry Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504972 70072 Alcohol Wine Alcohol/Wine Farm Fresh Mango Moscato 750ml Farm Fresh Wine Company 9.98 9.98 750.0
504973 70072 Alcohol Wine Alcohol/Wine Ole Orleans Heritage Riesling 750ml Ole Orleans 18.98 18.98 750.0

504974 rows × 9 columns

In [33]:
redudant_columns = ["SHIPPING_LOCATION", "BREADCRUMBS", "PRODUCT_NAME", "PRICE_CURRENT"]
  • In below cell, we are dropping redundant and useless features to avoid issues in modeling part.
In [34]:
featured_dataset.drop(redudant_columns, axis=1, inplace=True)
In [35]:
featured_dataset
Out[35]:
DEPARTMENT CATEGORY BRAND PRICE_RETAIL PRODUCT_SIZE
0 Deli Hummus, Dips, & Salsa Marketside 2.67 10.0
1 Deli Hummus, Dips, & Salsa Marketside 2.67 10.0
2 Deli Hummus, Dips, & Salsa Marketside 2.67 10.0
3 Deli Hummus, Dips, & Salsa Marketside 2.67 10.0
4 Deli Hummus, Dips, & Salsa Price's 3.12 12.0
... ... ... ... ... ...
504969 Alcohol Wine Farm Fresh Wine Company 9.98 750.0
504970 Alcohol Wine Farm Fresh Wine Company 9.98 750.0
504971 Alcohol Wine Farm Fresh Wine Company 9.98 750.0
504972 Alcohol Wine Farm Fresh Wine Company 9.98 750.0
504973 Alcohol Wine Ole Orleans 18.98 750.0

504974 rows × 5 columns

Handling Categorical Data¶
  • Now, we are cnoverting categorical data in one-hot encoded format to furhter use these featuers to give input to the model.
In [36]:
dummy_df = pd.get_dummies(featured_dataset[['DEPARTMENT','CATEGORY', 'BRAND']])
In [37]:
dummy_df
Out[37]:
DEPARTMENT_Alcohol DEPARTMENT_Bakery & Bread DEPARTMENT_Baking DEPARTMENT_Beverages DEPARTMENT_Breakfast & Cereal DEPARTMENT_Candy DEPARTMENT_Coffee DEPARTMENT_Dairy & Eggs DEPARTMENT_Deli DEPARTMENT_Fresh Produce ... BRAND_natural bliss BRAND_neuro drinks BRAND_new BRAND_noosa BRAND_nutpods BRAND_popchips BRAND_sea_best BRAND_simply... BRAND_vibi+ BRAND_vitaminwater
0 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
504969 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
504970 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
504971 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
504972 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
504973 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

504974 rows × 4016 columns

Applying PCA on Dummy data to reduce features¶
  • Having applied get dummies on all the categorical features, we are getting 4016 features, which is a really huge number to fit into any machine learning model.
  • So, we are applying Principal Component Analysis(PCA) to reduce the dimensionality of the features.
  • We are converting 4016 features in 25 features mmaintening the variance.
In [38]:
pca = PCA(n_components=25)  # Choose the number of components you want to retain
pca_result = pca.fit_transform(dummy_df)
In [39]:
pca_df = pd.DataFrame(data=pca_result)
In [40]:
pca_df
Out[40]:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 -0.419694 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
1 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 -0.419694 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
2 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 -0.419694 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
3 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 -0.419694 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
4 -0.162840 -0.188059 -0.083084 -0.134436 -0.061704 -0.007414 0.020106 0.262681 0.380736 -0.480189 ... -0.003061 -0.121335 0.006873 0.027117 0.004595 0.003790 0.004026 -0.003033 -0.003043 0.002084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
504969 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 -0.093395 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504970 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 -0.093395 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504971 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 -0.093395 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504972 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 -0.093395 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504973 -0.149853 -0.174627 -0.091233 -0.131267 -0.057705 -0.013895 -0.000124 0.149477 0.084634 -0.093337 ... -0.007454 -0.214887 -0.001550 -0.053231 0.006210 0.032816 0.003407 0.010243 -0.019743 0.067104

504974 rows × 25 columns

In [41]:
featured_dataset.drop(["DEPARTMENT", "CATEGORY", "BRAND"], axis=1, inplace=True)
  • Concatinating the dummy and original dataset.
In [42]:
featured_dataset = pd.concat([featured_dataset, pca_df], axis=1)
In [43]:
featured_dataset
Out[43]:
PRICE_RETAIL PRODUCT_SIZE 0 1 2 3 4 5 6 7 ... 15 16 17 18 19 20 21 22 23 24
0 2.67 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
1 2.67 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
2 2.67 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
3 2.67 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
4 3.12 12.0 -0.162840 -0.188059 -0.083084 -0.134436 -0.061704 -0.007414 0.020106 0.262681 ... -0.003061 -0.121335 0.006873 0.027117 0.004595 0.003790 0.004026 -0.003033 -0.003043 0.002084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
504969 9.98 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504970 9.98 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504971 9.98 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504972 9.98 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504973 18.98 750.0 -0.149853 -0.174627 -0.091233 -0.131267 -0.057705 -0.013895 -0.000124 0.149477 ... -0.007454 -0.214887 -0.001550 -0.053231 0.006210 0.032816 0.003407 0.010243 -0.019743 0.067104

504974 rows × 27 columns

Modeling ¶

In [44]:
train_test_dataset = featured_dataset.copy()
  • Here we are dividing data in X (Featuers) and y (Target).
In [45]:
X = train_test_dataset.drop(["PRICE_RETAIL"], axis=1)
In [46]:
y = train_test_dataset[["PRICE_RETAIL"]]
In [47]:
X
Out[47]:
PRODUCT_SIZE 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23 24
0 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
1 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
2 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
3 10.0 -0.189984 -0.248196 -0.118577 -0.067075 -0.066656 -0.017657 0.015691 0.317993 0.514401 ... 0.004211 0.458439 -0.008684 -0.493213 0.026586 0.030831 -0.056889 0.047118 -0.076353 0.162100
4 12.0 -0.162840 -0.188059 -0.083084 -0.134436 -0.061704 -0.007414 0.020106 0.262681 0.380736 ... -0.003061 -0.121335 0.006873 0.027117 0.004595 0.003790 0.004026 -0.003033 -0.003043 0.002084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
504969 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504970 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504971 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504972 750.0 -0.149875 -0.174667 -0.091255 -0.131305 -0.057724 -0.013900 -0.000124 0.149545 0.084681 ... -0.007462 -0.215112 -0.001552 -0.053304 0.006218 0.032867 0.003413 0.010249 -0.019786 0.067232
504973 750.0 -0.149853 -0.174627 -0.091233 -0.131267 -0.057705 -0.013895 -0.000124 0.149477 0.084634 ... -0.007454 -0.214887 -0.001550 -0.053231 0.006210 0.032816 0.003407 0.010243 -0.019743 0.067104

504974 rows × 26 columns

In [48]:
y
Out[48]:
PRICE_RETAIL
0 2.67
1 2.67
2 2.67
3 2.67
4 3.12
... ...
504969 9.98
504970 9.98
504971 9.98
504972 9.98
504973 18.98

504974 rows × 1 columns

In [49]:
y = np.log(y)
In [50]:
y
Out[50]:
PRICE_RETAIL
0 0.982078
1 0.982078
2 0.982078
3 0.982078
4 1.137833
... ...
504969 2.300583
504970 2.300583
504971 2.300583
504972 2.300583
504973 2.943386

504974 rows × 1 columns

Baseline Models¶

  • Here the base_model function applies the baseline models to ckeck the performcnce of simple models on the dataset.
  • The data has been split in training and testing part with 80:20 ratio.
  • As all of the features we have are normalized between 0 and 1, the product size feature has been scaled using MinMaxScaler to assure good model performance.
  • Additionally, we are getting scores like MSE, MAE, R2 score for training and testing.
In [62]:
def base_models(X, y, test_size=0.2, random_state=42):

    models = {
    'Linear Regression': LinearRegression(),
    'Lasso Regression': LassoCV(random_state=random_state),
    'Decision Tree': RandomForestRegressor(random_state=random_state)
    }
    
        
    X[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
    X.drop(['PRODUCT_SIZE'], axis=1, inplace=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    evaluation_results = {}

    for model_name, model in models.items():
        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)
        y_pred_train = model.predict(X_train)
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        r2_t = r2_score(y_train, y_pred_train)

        # Store evaluation results in a dictionary
        evaluation_results[model_name] = {
            'MSE': mse,
            'MAE': mae,
            'R2 test': r2,
            'R2 train' : r2_t
        }

    return evaluation_results
In [63]:
base_models(X, y, test_size=0.2, random_state=42)
C:\Users\dmaka\anaconda3\envs\Tflow1\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:1568: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\dmaka\AppData\Local\Temp\ipykernel_7368\131706591.py:18: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Out[63]:
{'Linear Regression': {'MSE': 0.38256908001293427,
  'MAE': 0.47927570750897064,
  'RMSE': 0.618521689201708,
  'R2 test': 0.214680528116167,
  'R2 train': 0.21564240024117065},
 'Lasso Regression': {'MSE': 0.38259045806527975,
  'MAE': 0.4791587849682495,
  'RMSE': 0.6185389705307821,
  'R2 test': 0.21463664427490803,
  'R2 train': 0.21553497121456755},
 'Decision Tree': {'MSE': 0.046740671160948113,
  'MAE': 0.09572179969210073,
  'RMSE': 0.21619590921418497,
  'R2 test': 0.9040529904027511,
  'R2 train': 0.9139985335752849}}
Linear and Lasso Regression :¶
  • The accuracy results of Linear and Lasso regression are 21 % and almost similar.

As Linear and Lasso regression are not able to handle Non-linearity and not robust to outliers.

Decision Tree :¶
  • Despite being simple dicision tree regressor it is performing way better than ordinary regression models mentioned above. The accuracy is 90 % on both training and testing sets.
MSE(Mean Squared Error) & MAE(Mean Absolute Error):¶
  • In comparison to the Linear Regression and Lasso Regression models, the Decision Tree model performs noticeably better in terms of MSE and MAE. The Decision Tree's lower MSE and MAE scores reflect that it generates more precise predictions and more closely fits the real values.

  • The MSE and MAE values for Lasso Regression and Linear Regression are higher, indicating that these models would be less successful than the Decision Tree at identifying the underlying patterns in the data.

High Performing Models¶

In [83]:
def evaluate_regression_models(X_, y, test_size=0.2, random_state=42):

    # Initialize the models
    models = {
        'CatBoost': CatBoostRegressor(random_state=random_state, silent=True),
        'XGBoost': XGBRegressor(random_state=random_state, verbosity=0),
        'RandomForest': RandomForestRegressor(random_state=random_state)
    }
    
    Mscaler = MinMaxScaler(feature_range=(0,1))
    
    X_[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
    X_ = X.drop(['PRODUCT_SIZE'], axis=1)
    
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=test_size, random_state=random_state)

    
    evaluation_results = {}

    for model_name, model in models.items():
        # Train the model
        model.fit(X_train, y_train)
#         model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)
        y_pred_train = model.predict(X_train)
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        r2_t = r2_score(y_train, y_pred_train)

        # Store evaluation results in a dictionary
        evaluation_results[model_name] = {
            'MSE': mse,
            'MAE': mae,
            'R2 test': r2,
            'R2 train' : r2_t
        }

    return evaluation_results
In [84]:
evaluate_regression_models(X, y, test_size=0.2, random_state=42)
C:\Users\dmaka\AppData\Local\Temp\ipykernel_7368\2068173715.py:23: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Out[84]:
{'CatBoost': {'MSE': 0.1257727503994698,
  'MAE': 0.2530568934650479,
  'R2 test': 0.7418197259492341,
  'R2 train': 0.7473200461957551},
 'XGBoost': {'MSE': 0.12138713590923876,
  'MAE': 0.24845567036839641,
  'R2 test': 0.7508223051833893,
  'R2 train': 0.7560006893486932},
 'RandomForest': {'MSE': 0.046740671160948113,
  'MAE': 0.09572179969210073,
  'R2 test': 0.9040529904027511,
  'R2 train': 0.9139985335752849}}
Cat Boost Regressor:¶
  • Cat Boost Regressor is giving same testing and training, 74 % accuracy which indecates model is fitting very well and does not overfit.
XG Boost Regressor¶
  • XG Boost Regressor is similar to Cat Boost Regressor but is ahead by only 1 %.
Random Forest Regressor¶
  • Random Forest Regressor performs the best amongst all the later models mentioned. The accuracy is similar as Decision Tree Regressor which is 91% and 90% for training and testing respectively.

Hyper Parameter Tuning using RandomizedSearchCV¶

In [51]:
def hyperparameter_tuning_RandomForest(X, y, test_size=0.2, random_state=42):
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Define parameter grid for RandomizedSearchCV
    param_grid = {
        'n_estimators': [300, 500, 800, 1000],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    Mscaler = MinMaxScaler(feature_range=(0,1))
    
    X[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
    X_ = X.drop(['PRODUCT_SIZE'], axis=1)
    
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=test_size, random_state=random_state)
    # Create a RandomForestRegressor
    RF = RandomForestRegressor(random_state=random_state, n_jobs=-1)
    
    # Perform RandomizedSearchCV
    rand_search = RandomizedSearchCV(estimator=RF, param_distributions=param_grid, cv=2, n_jobs=-1, verbose=2)
    
    # Fit the model
    rand_search.fit(X_train, y_train)
    
    # Print best parameters and score
    print("Best Parameters:", rand_search.best_params_)
    print("Best Score:", rand_search.best_score_)
    
    # Evaluate on test set
    test_score = rand_search.score(X_test, y_test)
    print("Test Set Score:", test_score)

# Example usage
# hyperparameter_tuning_RandomForest(X, y)
In [52]:
hyperparameter_tuning_RandomForest(X, y, test_size=0.2, random_state=42)
Fitting 2 folds for each of 10 candidates, totalling 20 fits
C:\Users\dmaka\anaconda3\envs\Tflow1\lib\site-packages\sklearn\model_selection\_search.py:909: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Best Parameters: {'n_estimators': 800, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 30}
Best Score: 0.8999592139431523
Test Set Score: 0.9035918918598134
In [58]:
def hyperparameter_tuning_XGBoost(X, y, test_size=0.2, random_state=42):
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Define parameter grid for GridSearchCV
    param_grid = {
        'n_estimators': [300, 600, 800],
        'max_depth': [3, 5, 10],
        'learning_rate': [0.01, 0.03 ,0.05, 0.1],
        'colsample_bytree': [0.8, 0.9, 1.0],
        'gamma': [0, 1, 2]
    }
    
    Mscaler = MinMaxScaler(feature_range=(0,1))
    
    X[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
    X_ = X.drop(['PRODUCT_SIZE'], axis=1)
    
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=test_size, random_state=random_state)
    
    
    # Create a RandomForestClassifier
    XG = XGBRegressor(random_state=random_state, n_jobs=-1)
    
    # Perform GridSearchCV
    rand_search = RandomizedSearchCV(estimator=XG, param_distributions=param_grid, cv=2, n_jobs=-1, verbose=2)
    
    # Fit the model
    rand_search.fit(X_train, y_train)
    
    # Print best parameters and score
    print("Best Parameters:", rand_search.best_params_)
    print("Best Score:", rand_search.best_score_)
    
    # Evaluate on test set
    test_score = rand_search.score(X_test, y_test)
    print("Test Set Score:", test_score)
In [59]:
hyperparameter_tuning_XGBoost(X, y, test_size=0.2, random_state=42)
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best Parameters: {'n_estimators': 300, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.9}
Best Score: 0.8618847570379943
Test Set Score: 0.8634410050433886
  • Having applied the RandomizedSearchCV, best parameters are :

With these features, XG Boost model performs the best with training and testing accuracy of 86 %.

  • Although the Randomized Search CV was applied, Random Forest wasn’t able to give better accuracy than it already did in the first run.

  • Out of all the models Decision Tree and Random Forest Regressor are giving promising results on both training and testing data.